16. Quality: Programmatic Assessment 2
Quality Programatic Assessment 2
Quiz
Using the results of the programmatic assessment in the Jupyter Notebook below, identify the results that are indicative of data quality issues in the following quizzes.
Quality: Programmatic Assessment
SOLUTION:
- Value count for the *surname* 'Doe' is 6
- 'Jake Jakobsen' is a duplicated name
- Lowest weight is 48.8 lbs
- No null entries are returned from `sum` and `isnull` on the *auralin* and *novodra* columns
Workspace
This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.
Workspace Information:
- Default file path:
- Workspace type: jupyter
- Opened files (when workspace is loaded): n/a
Solution
Quality Programatic Assessment 2 Solution
*Note: while the default John Doe data is a validity issue as described in the video, it is also a completeness issue because this default data displaced real patient data that is no longer in the *patients* table. Because completeness is more "severe" than validity, completeness is likely the more appropriate data quality dimension. This distinction is more appropriate to note because missing data is usually best addressed first when cleaning data, as you'll experience in Lesson 4. However, let's assume that this overwritten data can't be recovered, which makes treating it as a validity issue okay.*
'Elizabeth Knudsen' being a duplicated name isn't a data quality issue because 'Elizabeth Knudsen' is not a duplicated name. Her demographic information, which is filled with NaN entries, are duplicated though (since there are patients records with missing address, city, state, etc. information.
The indexes of the series returned by
sort_values
on the
weight
column
patients
table are supposed to be out of order since the original dataset isn't sorted by weight.